A couple months ago we launched a beta of the new flash template. Despite being quite ugly, it worked for the most part outside of one major issue; A small percentage of the animated gifs were not being converted to SWF properly. We are using a proprietary piece of software to do the conversion, and it was throwing an error claiming the files had > 16,000 frames. After a great deal of back and forth with the company who makes the software, we managed to get the source code for the animated gif portion.
davidc managed to skillfully track down the bug and also pointed me towards this terrific article on gif frame delays, which explains why there are so many browser-dependant synch problems with animated YTMNDs. At any rate, the flash template will now force a 100ms delay when 0ms delays are given. Some changes need to be made to the code, but this is good news.
This also means that once everything is converted over, every browser will play the images at the same speed, hopefully ensuring the true end to all synch problems.
Wednesday, April 15, 2009
Monday, February 23, 2009
Contests/Client-side data storage
Haven't been updating this as often as I'd planned as the response was pretty abysmal. At any rate, a good deal of development news. I have found a new designer who seems pretty excited about coming up with a new design, which I'm pretty happy about. In other news I've been working on finally implementing contests. Hopefully the first incantation of this feature will show up on the site on Wednesday.
One of my favorite (and simplest) features on the site is the comment highlighting which changes the background of a comment on a news post or site profile if it was made recently. This is a feature I've wanted to expand on greatly. If we could track the last visit time for each site, site profile, user profile, etc we could highlight content in many ways I think would be fantastic.
If we tracked a user's last visit to each of those areas we could:
* Highlight sites which have changed their image/sound since the last time you looked at it.
* Highlight comments on site profiles which were made since your last visit (as opposed to the current which is just the last hour)
* Highlight sites a user has made since the last time you visited their profile.
* Filter sites from search results and basically anywhere else on the site, allowing you to not see sites you've already viewed, or for that matter highlight sites you have already seen.
This data has a great deal of possibilities, but at the same time would require a great deal of server resources to store and manage.
The total number of rows would be around (Users * (Sites * 2)) + (Users * Users). That would result in hundreds of billions of rows of data assuming a high rate of activity. This has been a design hurdle that has plagued me for years, long before YTMND existed.
I've been giving it some thought recently and I have a plan to try and tackle this problem. If I used something like Google Gears, I could have a client-side database for each user to store their data, and then just use JavaScript to do all the filtering. I've even been thinking (for select (paid?)) users I could make a SQLite db for each user and then when they move to a new computer I could re-synch it back to them. At any rate, something fun to think about. More soon.
One of my favorite (and simplest) features on the site is the comment highlighting which changes the background of a comment on a news post or site profile if it was made recently. This is a feature I've wanted to expand on greatly. If we could track the last visit time for each site, site profile, user profile, etc we could highlight content in many ways I think would be fantastic.
If we tracked a user's last visit to each of those areas we could:
* Highlight sites which have changed their image/sound since the last time you looked at it.
* Highlight comments on site profiles which were made since your last visit (as opposed to the current which is just the last hour)
* Highlight sites a user has made since the last time you visited their profile.
* Filter sites from search results and basically anywhere else on the site, allowing you to not see sites you've already viewed, or for that matter highlight sites you have already seen.
This data has a great deal of possibilities, but at the same time would require a great deal of server resources to store and manage.
The total number of rows would be around (Users * (Sites * 2)) + (Users * Users). That would result in hundreds of billions of rows of data assuming a high rate of activity. This has been a design hurdle that has plagued me for years, long before YTMND existed.
I've been giving it some thought recently and I have a plan to try and tackle this problem. If I used something like Google Gears, I could have a client-side database for each user to store their data, and then just use JavaScript to do all the filtering. I've even been thinking (for select (paid?)) users I could make a SQLite db for each user and then when they move to a new computer I could re-synch it back to them. At any rate, something fun to think about. More soon.
Sunday, October 26, 2008
Acoustic fingerprinting and sound origins
This task of going through the millions of sound and image files on YTMND has had me thinking a lot about meta-data that we could be pulling out of files and displaying for everyone to see. The obvious plan of action would be to grab ID3 and EXIF info out of uploaded files and put it in the database somewhere. One issue is that the majority of the files uploaded are edited, chopped, shopped etc, so most of the data (if any) would be useless or only partially relevant.
I spent some time looking into fuzzy image algorithms for the dupecheck system I designed a while back but there didn't seem to be any open source method available that wasn't extremely simple proof-of-concept type code. It was clear from the research I did that it was out of my existing math education and time constraints to even try to implement anything similar.
Recently I've found the Shazam application on the iPhone to be quite helpful. To summarize, you just hold your phone near a source of music, it records for around 8 seconds and then computes a hash somehow and sends it off to their servers where it tries to find a match. I've found the success rate to be fairly good, maybe 70% of the time (usually the more mainstream stuff).
There is also an open project called MusicBrainz which aims to be an open-source CD lookup/identification system and they've created some nice programs that use third party acoustic fingerprinting among other methods to identify MP3 files to add and correct meta-data. The most recent system they've started using for acoustic fingerprinting is MusicDNS which is proprietary but seems fairly young and approachable.
I was thinking about how a system like this could be beneficial to YTMND. Once I am feeling less pressured to get important things done I'd like to play with the SDK on one of these service on some YTMND sound clips. At this point, many users go out of their way to put incorrect info in the origin fields, which I find disappointing. YTMND has been a great way to find new artists and music, so it would be nice to have this information automatically filled out for you.
I would be interested to see if anyone else has played with this sort of technology on non-standard music clips such as "mashups" or even artists like Girl Talk who use multiple clips of other songs in rapid succession. These days it is extremely common for music to use samples from other songs made over the last century. I imagine at this point it would be incredibly hard to accurately figure out the samples of every chunk used in a song or loop, but it would be pretty amazing to have some sort of tree-view that showed songs and what songs they sampled from etc.
Anyway, still waiting on that replacement power supply. Thanks to the IRS I don't currently have the monetary freedom to order a third replacement, so I am patiently harassing this company to ship the correct item.
I spent some time looking into fuzzy image algorithms for the dupecheck system I designed a while back but there didn't seem to be any open source method available that wasn't extremely simple proof-of-concept type code. It was clear from the research I did that it was out of my existing math education and time constraints to even try to implement anything similar.
Recently I've found the Shazam application on the iPhone to be quite helpful. To summarize, you just hold your phone near a source of music, it records for around 8 seconds and then computes a hash somehow and sends it off to their servers where it tries to find a match. I've found the success rate to be fairly good, maybe 70% of the time (usually the more mainstream stuff).
There is also an open project called MusicBrainz which aims to be an open-source CD lookup/identification system and they've created some nice programs that use third party acoustic fingerprinting among other methods to identify MP3 files to add and correct meta-data. The most recent system they've started using for acoustic fingerprinting is MusicDNS which is proprietary but seems fairly young and approachable.
I was thinking about how a system like this could be beneficial to YTMND. Once I am feeling less pressured to get important things done I'd like to play with the SDK on one of these service on some YTMND sound clips. At this point, many users go out of their way to put incorrect info in the origin fields, which I find disappointing. YTMND has been a great way to find new artists and music, so it would be nice to have this information automatically filled out for you.
I would be interested to see if anyone else has played with this sort of technology on non-standard music clips such as "mashups" or even artists like Girl Talk who use multiple clips of other songs in rapid succession. These days it is extremely common for music to use samples from other songs made over the last century. I imagine at this point it would be incredibly hard to accurately figure out the samples of every chunk used in a song or loop, but it would be pretty amazing to have some sort of tree-view that showed songs and what songs they sampled from etc.
Anyway, still waiting on that replacement power supply. Thanks to the IRS I don't currently have the monetary freedom to order a third replacement, so I am patiently harassing this company to ship the correct item.
Monday, September 22, 2008
Featured users algorithms
Some of you may have noticed the featured users list just changed and there's been a few people either angry or confused as to why they were removed, and others asking about the algorithm so I figured I'd write a quick post about it.
Featured users was an idea I came up with one night in response to ROY4L's thoughts on the motivation of users to produce quality content as well as my near religious following of Clay Shirky's "A Group Is Its Own Worst Enemy" essay.
"Featured users" was supposed to provide two main benefits; to find users who consistently make good content that we could highlight on the front page in an effort to make the front page less overwhelming and less of a "treasure hunt". Second, and more often overlooked was to create an incentive for the "cream of the crop" users to create new content and interact with the site more often.
The goal was (and is still) ambitious; figure out who the best content producers are mathematically. I was surprised that a simple algorithm could produce fairly good results. The last featured users list was roughly 98% of generated by algorithm, with the last 2% being me adding or removing users manually.
So here, for the first time, is a run down of the incredibly simple featured users algorithm
(originally called 'user score'):
(average_site_score * 1.2)
(average_number_of_votes * 0.23)
x (number_of_favorites * 0.43)
---------------------------------------
= Your dumb score.
We calculate a score for all the users, and then take the top score and convert it to 10000, and convert all other scores to fit into that percentage. An example of how this turns out shows that the weight is very light towards the top and heavy towards the bottom. Here are some sample results from an old run
#1 nutnics 10000
#2 ROY4L 9947.16
#3 phaseblue 8328.45
#4 max 6374.33
#5 astuteNacute 5957.35
#6 krebstar 4685.71
#7 syncan 4451.45
#8 kingstefan 3757.21
#9 ALMusic 3620.85
#10 PCF 3549.24
From there, the scores went down drastically, because users near the top skew the results for everyone else. The requisite for getting on the "list" was a score over 200, which only roughly 300 people achieve.
At, I used more data to base the scores off of, number of comments (this is why whetstone made the list), number of views, etc. I then realized sites like "Blue Ball Machine" skewed the averages for everyone, so I tried to do them with the top 5% of each users sites excluded (trying to discard anomalies), but the results were still really off.
The numbers don't lie. This algorithm is working on a large enough set of data that a few up-voting alts wont make a difference. More people are viewing, favoriting and voting on the featured users than those who arent featured (even if you use a time scale of a period before featured users existed).
Now the only thing that you can really muck with here is the weight on each piece of the algorithm. Normally, you can look at the results and change the algorithm to remove results you don't like or get results you do like, but a huge part of this is opinion based. Multiple people want DarthWang to be featured, but I find I think the problem is that you can't please everyone with a single list.
This time around, after I was persuaded to let it happen, BTape and Teknorat took the generated list and then added and removed people as they saw fit, which is what caused much of the recent change. So focus your rage towards them for the next couple weeks.
I also made a quick change to the featured users content box, which filters out duplicate users, so at any one point in time you wont see more than one site by each user, which I think will deal with a lot of the spam issues.
anyway, back to work, dongs.
Featured users was an idea I came up with one night in response to ROY4L's thoughts on the motivation of users to produce quality content as well as my near religious following of Clay Shirky's "A Group Is Its Own Worst Enemy" essay.
"Featured users" was supposed to provide two main benefits; to find users who consistently make good content that we could highlight on the front page in an effort to make the front page less overwhelming and less of a "treasure hunt". Second, and more often overlooked was to create an incentive for the "cream of the crop" users to create new content and interact with the site more often.
The goal was (and is still) ambitious; figure out who the best content producers are mathematically. I was surprised that a simple algorithm could produce fairly good results. The last featured users list was roughly 98% of generated by algorithm, with the last 2% being me adding or removing users manually.
So here, for the first time, is a run down of the incredibly simple featured users algorithm
(originally called 'user score'):
(average_site_score * 1.2)
(average_number_of_votes * 0.23)
x (number_of_favorites * 0.43)
---------------------------------------
= Your dumb score.
We calculate a score for all the users, and then take the top score and convert it to 10000, and convert all other scores to fit into that percentage. An example of how this turns out shows that the weight is very light towards the top and heavy towards the bottom. Here are some sample results from an old run
#1 nutnics 10000
#2 ROY4L 9947.16
#3 phaseblue 8328.45
#4 max 6374.33
#5 astuteNacute 5957.35
#6 krebstar 4685.71
#7 syncan 4451.45
#8 kingstefan 3757.21
#9 ALMusic 3620.85
#10 PCF 3549.24
From there, the scores went down drastically, because users near the top skew the results for everyone else. The requisite for getting on the "list" was a score over 200, which only roughly 300 people achieve.
At, I used more data to base the scores off of, number of comments (this is why whetstone made the list), number of views, etc. I then realized sites like "Blue Ball Machine" skewed the averages for everyone, so I tried to do them with the top 5% of each users sites excluded (trying to discard anomalies), but the results were still really off.
The numbers don't lie. This algorithm is working on a large enough set of data that a few up-voting alts wont make a difference. More people are viewing, favoriting and voting on the featured users than those who arent featured (even if you use a time scale of a period before featured users existed).
Now the only thing that you can really muck with here is the weight on each piece of the algorithm. Normally, you can look at the results and change the algorithm to remove results you don't like or get results you do like, but a huge part of this is opinion based. Multiple people want DarthWang to be featured, but I find I think the problem is that you can't please everyone with a single list.
This time around, after I was persuaded to let it happen, BTape and Teknorat took the generated list and then added and removed people as they saw fit, which is what caused much of the recent change. So focus your rage towards them for the next couple weeks.
I also made a quick change to the featured users content box, which filters out duplicate users, so at any one point in time you wont see more than one site by each user, which I think will deal with a lot of the spam issues.
anyway, back to work, dongs.
Thursday, September 18, 2008
Asset conversion
One of the problems with moving over to a Flash is that we can only import a finite set of file types at run-time. Flash doesn't support WAV or OGG or animated GIFs natively so part of the move is to clean out the asset system and convert file types where needed.
Originally when YTMND was created, there was no file type checking at all, people could basically upload anything and we would save it. Then we added simple file name checks, which removed a good portion of the miss-clicks, but people who had the proper file endings with invalid files still got through. The next variation of checks used mime-magic to actually check the file types to see if they were valid, but with the millions of different possibilities when encoding sound and images, mime-magic still wasn't perfect. In addition to that, the original mime-magic setup checked if the user uploaded a sound and an image, but didn't make sure they put them in the right fields.
Today, there are literally thousands of sites using images as sounds and vice versa, hundreds of sites using files we don't support like MIDI, OGG, etc. Hundreds more use word documents, zip files, executables, mpeg clips, etc. So far I've found some pretty interesting file types, once I do an initial pass on the whole file system I will post up statistics.
Now that I am at the stage where we need to convert the entire site over to work in Flash, I have written a much more thorough file type checking routine. One of the nice things about the fact that we are forced to do conversion on a lot of assets is that once we are in there converting, adding new file types will be easy. Hundreds of people have uploaded MIDI as sound files on their sites, which currently don't work at all. In the new conversion routine, MIDI files are converted to WAV and then compressed to MP3, so we can allow people to upload MIDI files if they want.
The new system will be basically be backwards compatible to the old system, so SWF versions of WAV files are just treated as children of the original asset, i.e. the original asset is archived. This means if we allowed MIDI files, they would be converted to MP3 to be heard in Flash, but users would still be able to download the original MIDI file from the site's asset pages.
So what sort of files do you think it would be beneficial to add to the allowed upload list? OGG? MOD? NSF or all the other 8 bit music types? What about image types?
I have thought about allowing SWF/FLV since a lot of people make their animated GIFs in Flash, but not only do I think the compression requires a human hand, I am concerned that YTMND would become even more of a YouTube clone than it is now.
Anyway that's all for now. Post a comment if you have any ideas or questions or hit up the IRC if you want to chat about it.
Originally when YTMND was created, there was no file type checking at all, people could basically upload anything and we would save it. Then we added simple file name checks, which removed a good portion of the miss-clicks, but people who had the proper file endings with invalid files still got through. The next variation of checks used mime-magic to actually check the file types to see if they were valid, but with the millions of different possibilities when encoding sound and images, mime-magic still wasn't perfect. In addition to that, the original mime-magic setup checked if the user uploaded a sound and an image, but didn't make sure they put them in the right fields.
Today, there are literally thousands of sites using images as sounds and vice versa, hundreds of sites using files we don't support like MIDI, OGG, etc. Hundreds more use word documents, zip files, executables, mpeg clips, etc. So far I've found some pretty interesting file types, once I do an initial pass on the whole file system I will post up statistics.
Now that I am at the stage where we need to convert the entire site over to work in Flash, I have written a much more thorough file type checking routine. One of the nice things about the fact that we are forced to do conversion on a lot of assets is that once we are in there converting, adding new file types will be easy. Hundreds of people have uploaded MIDI as sound files on their sites, which currently don't work at all. In the new conversion routine, MIDI files are converted to WAV and then compressed to MP3, so we can allow people to upload MIDI files if they want.
The new system will be basically be backwards compatible to the old system, so SWF versions of WAV files are just treated as children of the original asset, i.e. the original asset is archived. This means if we allowed MIDI files, they would be converted to MP3 to be heard in Flash, but users would still be able to download the original MIDI file from the site's asset pages.
So what sort of files do you think it would be beneficial to add to the allowed upload list? OGG? MOD? NSF or all the other 8 bit music types? What about image types?
I have thought about allowing SWF/FLV since a lot of people make their animated GIFs in Flash, but not only do I think the compression requires a human hand, I am concerned that YTMND would become even more of a YouTube clone than it is now.
Anyway that's all for now. Post a comment if you have any ideas or questions or hit up the IRC if you want to chat about it.
Subscribe to:
Posts (Atom)