Open Sourcing COVID-19 Data with Cindy Wang & Gil Yehuda
June 26th, 2020
46 mins 45 secs
About this Episode
Sponsored By Linode
Justin Dorfman | Eric Berry | Richard Littauer
Sr. Director, Product Management, Yahoo Knowledge Graph
Sr. Director of Open Source
Hello and welcome to Sustain! In this episode, we have special guests, Gil Yehuda and Cindy Wang, who both work for Verizon Media, which is a combination of a bunch of companies, predominantly Yahoo and AOL. Gil is Senior Director, Open Source Program and Cindy is Sr. Director, Product Management, Yahoo Knowledge Graph. We learn more about Gil and Cindy’s positions with Yahoo, the Yahoo Knowledge Graph COVID-19 project, data sets, complications with data, and Vespa (open source big data serving engine).
[00:02:26] Gil explains to us what coverage he has and what he’s responsible for in his OSPO (Open Source Program Office). He also tells us how many repos and orgs he’s managing.
[00:05:29] Cindy tells us all about the Yahoo Knowledge Graph COVID-19 project. Justin questions data sets and its inconsistencies and Cindy explains.
[00:12:30] Eric asks Cindy if this resource has been established as an authority and if she’s heard feedback or others pointing to this as the authoritative data source?
[00:14:00] Gil explains to us two levels of complications with data that he’s observing.
[00:18:30 ] In regard to financial incentivisation, Eric wonders what has been their experience, or have they had any feedback from people who are trying to massage the numbers in their favor?
[00:21:22 ] Richard wants to know if there is any code open source and can people look at that? How can people get involved and what was that process like besides the data aspects? Also, Gil tells us if he has any pushbacks from making any of this stuff open.
[00:29:01] Gil mentions Vespa.ai, an open source big data serving engine. Richard wonders if Gil has thought of long term plans for how he sustains this work and how it’s going forward and what teams will be on it, and will it just be open source in the sense of like a year?
[00:31:57] Richard wonders if Gil and Cindy have plans to onboard people from the community who are interested in the data who are helping out so that they also become maintainers, so it’s not just a Yahoo only project internally.
[00:33:08] Eric asks Gil to elaborate on a follow up question where he said he was using these tools internally. Cindy tells us all about the tools. Also, Eric wonders if there was any questions or concerns about licensing the open source and are people allowed to build commercial applications on top of this data?
[00:40:24] Gil and Cindy tell us where people can get involved in this project, how can you follow along, and how can you follow them.
- [00:42:20] Richard’s spotlight is Moment.js.
- [00:42:39] Eric’s spotlight is a project built by Jared White called Bridgetown, which is an updated version of Jekyll.
- [00:43:49] Justin’s spotlights are to thank Ashley Wolf for putting this whole thing together and a browser extension called Read Aloud, a text to speech voice reader.
- [00:44:31] Gil’s spotlight is a project called Denali.
[00:04:51] "AOL had an OSPO and they didn’t have an OSPO and they kind of had an OSPO, but when we merged together we brought it together and we just continue to do what we do.”
[00:05:04] “Before OSPO there was Open Source activity because as you know companies do Open Source even without OSPO’s. They just do Open Source better with OSPO’s.”
[00:14:00] “There’s two levels of complications with data that I’m observing and there’s probably more, because there’s always more to everything.”
[00:14:48] “But then there’s this other element which is, I don’t know, maybe it’s the political nature of data.”
[00:16:23] “And I guess all of the paddling that goes on under the surface of the water to collect that data and to be as accurate as you can, but also to connect it to the source so that you could investigate it.”
[00:20:36] “The training set has to be clean, so they actually spend 80% of their effort in cleaning the data.”
[00:34:28] “So, for example, you look at some states now after opening, the numbers shot up. So, is it concerning from business planning perspective? Perhaps.”
[00:37:23] “We have hundreds of millions of entities in this graph that represent billions of pieces of information that we use across the company for all types of things, like how the news stream is ordered.”
Yahoo! Developer Network (YDN)
Yahoo! Developer Dash Open Podcast
- Produced by Justin Dorfman at CodeFund
- Edited by Paul M. Bahr at Peachtree Sound
- Show notes by DeAnn Bahr at Peachtree Sound
- Ad Sales by Eric Berry at CodeFund